A chromosome-level genome assembly of the Echiura Urechis unicinctus

Echiura is a distinctive family of unsegmented sausage-shaped marine worms whose phylogenetic relationship still needs strong evidence from the phylogenomic analysis. In this family, Urechis unicinctus is known for its high nutritional and medicinal value and adaptation to harsh intertidal conditions. Herein, we combined PacBio long-read, short-read Illumina and Hi-C sequencing, generating a high-quality chromosome-level genome assembly of U. unicinctus. The assembled genome spans ~1,138.6 Mb with a scaffold N50 of 68.3 Mb, of which 1,113.8 Mb (97.82%) were anchored into 17 pseudo-chromosomes. The BUSCO analysis demonstrated the completeness of the genome assembly and gene model prediction are 93.5% and 91.5%, respectively. A total of 482.1 Mb repetitive sequences, 21,524 protein-coding genes, 1,535 miRNAs, 3,431 tRNAs, 124 rRNAs, and 348 snRNAs were annotated. This study significantly improves the quality of U. unicinctus genome assembly, sets the footsteps for molecular breeding and further study in genome evolution, genetic and molecular biology of U. unicinctus.


Background & Summary
Echiura, commonly referred to as spoon worms, are bilaterally symmetrical and coelomate marine invertebrates with a sausage-shaped body living in burrows in the sediments.They possess annelid-like morphological features, including the ladder-like nervous system, the ultrastructure of cuticle and chaetae, and the larval nervous system segments 1 , however, they have secondarily lost segmentation as adults, providing a particularly important model for understanding the mechanism underlying segment formation and secondary loss 2 .Furthermore, the evolutionary relationship between the Echiura and Annelida is in a long-standing controversy.Given the lack of segmentation as adults, Echiura has generally been regarded as a separate phyla closely related to Annelida.Recently, some researchers advised the inclusion of Echiura into Annelida based on the increasing amounts of morphological, molecular phylogenetic and phylogenomic evidence, including expressed sequence tags, transcriptome, and mitochondrial genome [3][4][5][6][7][8] .As solid evidence for a better understanding of their deep-level evolutionary relationships, phylogenomic analysis from high-quality chromosome-level genome data of the Echiura is still lacking.
Urechis unicinctus (also called the penis fish or the fat innkeeper worm), belonging to the Echiura, is a deposit-feeding burrowing organism that inhabits the intertidal zones along the Korean and Japanese coast and Bohai Gulf on the northeast coast of China.The intertidal zones are peculiar and dynamic areas which are vulnerable to a host of stressors, like steep gradients in temperature and oxygen concentration, threats from pathogen infections, pollution, and toxic substances 9 .The U. unicinctuss without adaptive immunity can survive in such harsh environments, providing an exciting resource for investigating environmental adaptative evolution.In addition, this endemic Echiuran species has essential ecological and socioeconomic significance and has become an important cultured aquaculture species due to its desirable flavour, nutrient-rich and high medicinal values in Asian countries, especially in China, Japan, and Korea 10 .The first draft genome assembly of U. unicinctuss based on Illumina short reads was published in 2021 11 .However, due to the limitations of the sequencing technique and assembly algorithm, the genome assembly with a contig N50 length of 0.458 kb and scaffold N50 length of 0.517 kb remains highly fragmented (Table 1), which lags far behind the demand for further study of genetic and molecular in U. unicinctuss.Hence, a high-quality chromosome-scale genome assembly of U. unicinctuss are essential in elucidating its genome evolution and adaptive evolution, and providing theoretical support for the species' culture.
Here, we present a high-quality chromosome-scale genome assembly of U. unicinctus obtained by combining Illumina, PacBio, and high-throughput chromosome conformation capture (Hi-C) sequencing technology toolkits.The U. unicinctus genome, with a total size of ~1,138.6Mb, was assembled into 1,394 scaffolds (N50 = 68.3Mb).A total of 1,113.8Mb assembled sequences (97.82%) were further anchored to 17 pseudochromosomes (Fig. 1).The quality of the genome assembly is significantly higher than that of the previously published version (NCBI accession No. PRJNA603659), with contig N50 being ~1,160 times higher and scaffold N50 being ~135,472 times higher (Table 1) 11 .The completeness, accuracy, and contiguity of the genome assembly were evaluated by Benchmarking Universal Single-Copy Ortholog (BUSCO) analysis, Core Eukaryotic Genes Mapping Approach (CEGMA), re-alignment between clean Illumina reads and the genome assembly, and SNP identification.Of the assembled genome, 482.1 Mb (42.34%) were repetitive sequences with a dominance of DNA elements.Additionally, a total of 21,524 protein-coding genes were annotated, of which 99.5% could be functionally annotated.This chromosome-level genome assembly builds the foundation for the understanding of genome evolution and evolutionary adaption and provides a valuable tool for further studies on the genetic and molecular biology of U. unicinctus.

Methods
Samples collection and whole-genome sequencing.Adult U. unicinctus samples were obtained from the field of Xiyan, Yantai, Shandong Province, China (121°25'E, 37°56'N), and genomic DNA extracted from the muscle tissue was collected for whole-genome sequencing using a QIAGEN DNeasy Blood & TissueKit (QIAGEN, Shanghai, China).Paired-end Illumina sequence library with insert size 350 bp and 10× Genomics linked-read library were sequenced by Illumina HiSeq X Ten platform with 97.74 Gb of short-read sequencing data (Table 1).For long-read sequencing, a library with an insert of 20 kb was constructed using SMRTbell Template Prep Kits, followed by PacBio single-molecule real-time (SMRT) sequencing using Pacbio Sequel Platform (Pacific Biosciences, Menlo Park, USA), generating approximately 142.1 Gb of long-read raw data.
Transcriptome sequencing.For transcriptome sequencing, four tissues, including intestines, gonads, blood, and muscle, were sampled from the same individual and stored in liquid nitrogen.RNA was extracted from these tissues and used for transcriptome sequencing, respectively.The cDNA paired-end libraries were prepared and sequenced on an Illumina HiSeq X sequencer (Paired-end 350 bp reads).Approximately 26.51 Gb of clean data were yielded from the RNA-seq raw data after quality control using fastp v. 0.21.0 12 (Table 1).
Genome size estimation and assembly.Jellyfish v. 2.1.3method 13 with k-mer distribution was employed to calculate k-mer frequency (k = 17) based on the high-quality paired-end reads (with an insert size of 350 bp).
The distribution of 17-mer depends on the characteristic of the genome and follows a Poisson's distribution.The genome size was estimated to be 1,396.33MB with K-mer depth of 58.The genome heterozygosity and repeat ratio are 1.25% and 53.86%, respectively (Table 2).The WTDBG software v. 2.5, https://github.com/ruanjue/wtdbg)was used to assemble the contig of the U. unicinctus genome with parameters setting as '--node-drop 0.20 --node-len 2304 --node-max 150 -s 0.05 -e 3′.Then, Racon v. 1.3.1 14 with default parameters was used to correct errors of contigs assembly by PacBio data.The resulting contigs were connected to super-scaffolds by 10× Genomics linked-read data using the fragScaff software v. 140324 with parameters setting as '-maxCore 200 -m 3000 -q 30 -C 5' 15 .Lastly, pilon v. 1.22 with parameters setting as '-Xmx300G --diploid --threads 20' 16 was used to perform the second round of error correction with short paired-end reads generated from Illumina Hiseq X Ten Platforms.The total length of the contig assembly was 1130.4Mb with the contig N50 size of 528.1 Kb (Table 3).For the scaffolding step, SSPACE v. 3.0 17 was first used to construct scaffolds using HiSeq data from all the mate-pair libraries (2 kb, 5 kb, 10 kb and 20 kb).FragScaff v. 140324 was further applied to build superscaffolds using the barcoded sequencing reads, generating a genome with a scaffold N50 size of 1080.3Kb.The total length of this version is 1146.5 Mb.

Hi-C library construction, sequencing and pseudo-chromosome anchoring. The Hi-C library was
constructed by a standard protocol described previously with certain modifications 18 .Briefly, the mussel tissue of U. unicinctus was fixed with 1% formaldehyde solution in MS buffer (10 mM potassium phosphate, pH 7.0; 50 mM NaCl; 0.1 M sucrose), and the nuclei were enriched from flow-through and subsequently digested with HindIII restriction enzyme (NEB).Biotin-labeled DNAs were ligated and purified, followed by fragmenting to a size of 300-500 bp.After a quality control process, the constructed Hi-C library was sequenced on an Illumina HiSeq X Ten sequencer with paired-end 350 bp.In total, 159.47 Gb of high-quality Hi-C data with 132.89 × coverage was acquired (Table 1).

Annotation of repeats and non-coding RNA (ncRNA).
Homologous comparison and de novo prediction were applied to annotate the repeated sequences in the assembled genome.For homologous comparison, the RepeatMasker v. 4.0.7 20 and the associated RepeatProteinMask v. 4.05 21 were performed to align against Repbase database 22 .For ab initio prediction, LTR_FINDER v.1.07 23, RepeatScout v. 1.05 24 and RepeatModeler v. 1.05 25 were first used for de novo candidate database constructing of repetitive elements.The repeated sequences were annotated using RepeatMasker v. 4.0.7 Tandem repeat sequences were de novo predicted using TRF v. 4.07b 26 .In total, 482.1 Mb repetitive sequences were annotated, accounting for 42.34% of the assembled U. unicinctus genome (Table 6).Among the repetitive sequences, DNA transposons (DNA), long interspersed elements (LINE),    short interspersed nuclear elements (SINEs), and long terminal repeats (LTRs) accounted for 20.72%, 4.26%, 0.25%, and 10.60% of the whole genome, respectively (Table 7).For ncRNA annotation, the tRNAs were predicted using tRNAscan-SE v. 1.3.1 software 27 , and the rRNAs fragments were identified by searching against the Human rRNA database using BLAST with an E-value of 1E-10.Other ncRNAs, including microRNAs (miRNA) and small nuclear RNAs (snRNAs) were predicted by INFERNAL v. 1.1rc4 28 using Rfam database 29 .Finally, a total of 5,908 ncRNAs were annotated, including 1,535 miRNAs, 3,431 tRNAs, 124 rRNAs, and 348 snRNAs in U. unicinctus genome(Table 8).
For functional annotation, the predicted protein sequences were aligned against SwissProt 43 , NCBI's non-redundant protein sequence databases (NR), InterPro 44 , Gene Ontology (GO) 45 , Kyoto Encyclopedia of Genes and Genomes (KEGG) 46 and Pfam protein databases 47 by BLASTP (E-value ≤ 1E-05) with the matched rates of 74.1%, 93.2%, 74.2%, 98.5%, 90.6%, and 67.8%, respectively (Table 10).InterproScan tool 48 in coordination with InterPro database was applied to predict protein function based on the conserved protein domains and functional sites.In total, 21,408 genes were functionally annotated by at least one database, accounting for 99.5% of all predicted genes, among which 123,56 (68.33%) were supported by all six databases (Fig. 2).

Data Records
All raw genomic sequencing data (Illumina, PacBio, Hi-C, 10× genomics) were deposited in the NCBI Sequence Read Archive (SRA) database with accession numbers SRR25893129 49 and SRP458201 50 .Four transcriptome data from intestine, blood, gonad, and muscle were submitted to the NCBI SRA database with accession numbers SRR25683611, SRR25683610, SRR25683609, and SRR25683608, respectively, under accession the BioProject number PRJNA1006514 51 .The final chromosome assembly was deposited in the GenBank at the NCBI (JAXDRA000000000) 52 .The sequences of CDS, and protein and results of genome annotation, including repeat sequences, protein-coding regions, and ncRNA annotation, are available in figshare 53 .

technical Validation
Quality validation of sequencing data.The quality control of Illumina, 10 × genomic and transcriptome sequencing data was assessed using FastQC quality control (http://www.bioinformatics.bbsrc.ac.uk/projects/fastqc/).The Q20 of Illumina sequencing data was greater than 95.92%, and Q30 was greater than 89.67%.The Q20 of 10× genomic sequencing data was greater than 92.71%, and Q30 was greater than 85.77%.The Q20 of transcriptome sequencing data was larger than 97.67%, and Q30 was larger than 93.41%.The low-quality reads (<Q20) were filtered to ensure the reliability of the data used in subsequent analyses.assessment of the genome assembly.The quality of the genome assembly was assessed by four methods as follows: (i) The evaluation of the genome assembly by BUSCO v. 5.1.2 54suggested a high level of completeness (93.5%).Of 954 metazoa BUSCO genes, 92.7% were complete and single-copy, 0.8% were complete and duplicated, 4.0% were fragmented, and 2.5% were missing (Table 11).Additionally, completeness of the gene model prediction was also evaluated by BUSCO v. 5.1.2,generating a score of 91.6% (4.7% fragmented and 3.7% missing BUSCOs) (Table 12); (ii) Employing CEGMA (RRID: SCR 015055) 55 , we detected 240 (96.77%) of 248 core eukaryotic genes were detected in the genome assembly, including 221 (89.11%) complete genes.(iii) The clean short reads generated by the Illumina platform were mapped to the assembled U. unicinctus genome using BWA with parameters setting as '-o 1 -i 15' 56 .The results showed a mapping rate of 97.83% and a coverage rate of 93.90%; (iv) To evaluate the accuracy of the assembly at a single base level, variant calling with SAMTOOLS v. 0.1.19was performed 57 .A total of 8,064,289 SNPs, including 7,985,055 heterozygous SNPs and 79,234 homozygous SNPs, identified with a homozygous rate of 0.0081% (Table 13).All these results suggested the high completeness and accuracy of the U. unicinctus genome assembly.

Fig. 1
Fig. 1 Overview of U. unicinctus genome.(a) Circos plot of genomic features.From outer to inner ring: (A) gene density; (B) GC content; (C) transposable element abundance; (D) Tandem repeat density.Insert: image of adult U. unicinctus.(b) Pseudochromosomes interaction heatmap of U. unicinctus based on Hi-C assembly.Color block indicates the intensity of interaction from yellow (low) to red (high).

Fig. 2
Fig. 2 Venn diagram of functional annotation of the U. unicinctus protein-coding genes.The Venn diagram shows the shared and unique annotations among NR, KEGG, SwissProt, and InterPro.

Table 1 .
Comparative statistics of U. unicinctus genome assemblies in 2021 and 2023.

Table 2 .
The genome size estimation of U. unicinctus by k-mer distribution.

Table 3 .
The de novo assembly statistics of U. unicinctus genome.97.82% were anchored into 17 pseudo-chromosomes ranging from 41.8 Mb to 86.6 Mb in length (Fig.

Table 4 .
The statistics of anchored rate and chromosome-level scaffold lengths.

Table 6 .
The annotation of repeated sequences in U. unicinctus genome.

Table 5 .
The assembly statistics for Hi-C.

Table 7 .
Summary statistics of repeat annotation in U. unicinctus assembly.

Table 8 .
Annotation of non-coding RNA genes in U. unicinctus assembly.

Table 9 .
The statistics of predicted protein-coding genes of U. unicinctus assembly.The symbol of '*' indicated that the untranslated regions were included, and vice versa.

Table 10 .
Functional annotation of the predicted protein-coding genes in U. unicinctus.

Table 11 .
The BUSCO and CEGMA evaluation result of the genome assembly.

Table 12 .
The BUSCO score of the gene models.

Table 13 .
The statistics of SNP in U. unicinctus assembly.